{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Benchmark over-sampling methods in a face recognition task\n\nIn this face recognition example two faces are used from the LFW\n(Faces in the Wild) dataset. Several implemented over-sampling\nmethods are used in conjunction with a 3NN classifier in order\nto examine the improvement of the classifier's output quality\nby using an over-sampler.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: Christos Aridas\n#          Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)\n\nimport seaborn as sns\n\nsns.set_context(\"poster\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Load the dataset\n\nWe will use a dataset containing image from know person where we will\nbuild a model to recognize the person on the image. We will make this problem\na binary problem by taking picture of only George W. Bush and Bill Clinton.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\nfrom sklearn.datasets import fetch_lfw_people\n\ndata = fetch_lfw_people()\ngeorge_bush_id = 1871  # Photos of George W. Bush\nbill_clinton_id = 531  # Photos of Bill Clinton\nclasses = [george_bush_id, bill_clinton_id]\nclasses_name = np.array([\"B. Clinton\", \"G.W. Bush\"], dtype=object)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "mask_photos = np.isin(data.target, classes)\nX, y = data.data[mask_photos], data.target[mask_photos]\ny = (y == george_bush_id).astype(np.int8)\ny = classes_name[y]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can check the ratio between the two classes.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nimport pandas as pd\n\nclass_distribution = pd.Series(y).value_counts(normalize=True)\nax = class_distribution.plot.barh()\nax.set_title(\"Class distribution\")\npos_label = class_distribution.idxmin()\nplt.tight_layout()\nprint(f\"The positive label considered as the minority class is {pos_label}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We see that we have an imbalanced classification problem with ~95% of the\ndata belonging to the class G.W. Bush.\n\n## Compare over-sampling approaches\n\nWe will use different over-sampling approaches and use a kNN classifier\nto check if we can recognize the 2 presidents. The evaluation will be\nperformed through cross-validation and we will plot the mean ROC curve.\n\nWe will create different pipelines and evaluate them.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn import FunctionSampler\nfrom imblearn.over_sampling import ADASYN, RandomOverSampler, SMOTE\nfrom imblearn.pipeline import make_pipeline\nfrom sklearn.neighbors import KNeighborsClassifier\n\nclassifier = KNeighborsClassifier(n_neighbors=3)\n\npipeline = [\n    make_pipeline(FunctionSampler(), classifier),\n    make_pipeline(RandomOverSampler(random_state=42), classifier),\n    make_pipeline(ADASYN(random_state=42), classifier),\n    make_pipeline(SMOTE(random_state=42), classifier),\n]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import StratifiedKFold\n\ncv = StratifiedKFold(n_splits=3)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will compute the mean ROC curve for each pipeline using a different splits\nprovided by the :class:`~sklearn.model_selection.StratifiedKFold`\ncross-validation.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.metrics import RocCurveDisplay, roc_curve, auc\n\ndisp = []\nfor model in pipeline:\n    # compute the mean fpr/tpr to get the mean ROC curve\n    mean_tpr, mean_fpr = 0.0, np.linspace(0, 1, 100)\n    for train, test in cv.split(X, y):\n        model.fit(X[train], y[train])\n        y_proba = model.predict_proba(X[test])\n\n        pos_label_idx = np.flatnonzero(model.classes_ == pos_label)[0]\n        fpr, tpr, thresholds = roc_curve(\n            y[test], y_proba[:, pos_label_idx], pos_label=pos_label\n        )\n        mean_tpr += np.interp(mean_fpr, fpr, tpr)\n        mean_tpr[0] = 0.0\n\n    mean_tpr /= cv.get_n_splits(X, y)\n    mean_tpr[-1] = 1.0\n    mean_auc = auc(mean_fpr, mean_tpr)\n\n    # Create a display that we will reuse to make the aggregated plots for\n    # all methods\n    disp.append(\n        RocCurveDisplay(\n            fpr=mean_fpr,\n            tpr=mean_tpr,\n            roc_auc=mean_auc,\n            estimator_name=f\"{model[0].__class__.__name__}\",\n        )\n    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the previous cell, we created the different mean ROC curve and we can plot\nthem on the same plot.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, ax = plt.subplots(figsize=(9, 9))\nfor d in disp:\n    d.plot(ax=ax, linestyle=\"--\")\nax.plot([0, 1], [0, 1], linestyle=\"--\", color=\"k\")\nax.axis(\"square\")\nfig.suptitle(\"Comparison of over-sampling methods \\nwith a 3NN classifier\")\nax.set_xlim([0, 1])\nax.set_ylim([0, 1])\nsns.despine(offset=10, ax=ax)\nplt.tight_layout()\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We see that for this task, methods that are generating new samples with some\ninterpolation (i.e. ADASYN and SMOTE) perform better than random\nover-sampling or no resampling.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}